FindZebra online search delving into rare disease case reports using natural language processing

Early diagnosis is crucial for well-being and life quality of the rare disease patient. Access to the most complete knowledge about diseases through intelligent user interfaces can play an important role in supporting the physician reaching the correct diagnosis. Case reports may offer information about heterogeneous phenotypes which often further complicate rare disease diagnosis. The rare disease search engine FindZebra.com is extended to also access case report abstracts extracted from PubMed for several diseases. A search index for each disease is built in Apache Solr adding age, sex and clinical features extracted using text segmentation to enhance the specificity of search. Clinical experts performed retrospective validation of the search engine, utilising real-world Outcomes Survey data on Gaucher and Fabry patients. Medical experts evaluated the search results as being clinically relevant for the Fabry patients and less clinically relevant for the Gaucher patients. The shortcomings for Gaucher patients mainly reflect a mismatch between the current understanding and treatment of the disease and how it is reported in PubMed, notably in the older case reports. In response to this observation, a filter for the publication date was added in the final version of the tool available from deep.findzebra.com/ with  = gaucher, fabry, hae (Hereditary angioedema).

(2) Introduction: (2a) The combined use of already existing models should be added when mentioning the two components for the tool (p. 3). Otherwise, the impression is created that the components were completely newly developed by the authors. Answer: Good point. We have clarified what are existing tools and what are new ones. (2b) The goal of the study is stated as the evaluation of the tool (p. 3). However, the title and the methods also describe the functional area of the extension. Therefore, the research objective (and possibly also the title) should be modified accordingly. Answer: Good point. This has been clarified along the lines under point (1). We are open to changing the title. Suggestions are welcome.
(3) Material and methods: (3a) When explaining the workflow, only the goal is named and not the method of functioning. A more detailed explanation -especially of Figure 2 -is necessary here (p. 3) Answer: good point. Done. (4) Results: (4a) The description for "Search index" should focus more on actual results and less on the procedure (p. 7). Answer: Good point. We have moved this subsection to Search engine subsection in Methods. (4b) It is not always clear which results were obtained by which methods; especially with respect to the validation process, it would be interesting to know which results resulted from which step. Here, a reference to the individual steps would be important. (An example of this is the evaluation of the previous FindZebra tool (p. 8)). Answer: We have clarified this.
MINOR ISSUES (5) Introduction: (5a) Transition from explanations of diseases to explanation of PubMed could seem more natural by adding the research gap again (p. 2). Answer: Yes agreed. We have augmented the text here. (5b) Claim that a tool is missing is not substantiated, so perhaps add "to our knowledge" (p. 3). Answer: Added. (5c) Incomplete sentence: "Case reports could be used to improve the clinical management of today's patients, for instance by tailoring the treatment to the patient profile, or by supporting the healthcare providers in their [16,17]

Reviewer #2
Reviewer #2: The manuscript describes a new search functionality for rare disease case reports of FindZebra.com service. Two rare diseases of Gaucher and Fabry were used as the exemplar study diseases. Natural language processing models/tools were utilised to support the modelling/indexing of case reports and the matching between user queries and case reports. Particularly, user queries were generated from 'real' patient cases including age/gender and clinical features like symptoms and phenotypes. Evaluation protocols and metrics were proposed to validate the utilities of the service in supporting clinical decision making for patients with those diseases, seemingly in scenarios of both diagnosis and treatments.
Overall, from clinical utility point of view, this would potentially be a very valuable work and much needed service for supporting rare disease diagnosis, treatment and managements. However, technically -from information retrieval and NLP point of view, the work requires further developments and clarifications to make it publishable -in other words, making substantial contribution to the field and useful for the community.
1. It is not clear how NLP models were used and developed. Named entity recognition was mentioned only in the abstract. In the main text and supplementary it was called segmentations. The two might be totally different NLP tasks. The use of terminology aside, there is no information how the NER was done.
PubmedBERT was mentioned to be the language model for fine-tuning the segmentation task (assuming the NER for clinical features like symptoms etc). However, there was no mention where the ground truth of NER came from. Answer: Good point. We have substantially updated the description of the solution so that the terminology is unified and it becomes more clear how we labeled the data and use the BERT model. In particular, we removed mentions of the NER task as we only explored the task of text segmentation. Note that we aimed at labeling longer chunks of text than being done for the NER tasks. The main motivation was to automatically extract a structured patient profile.